Universal Access To All Knowledge
Home donate | Forums | FAQs | Contributions | Terms, Privacy, & Copyright | Contact | Volunteer Positions | Jobs | Bios
Search: Advanced Search
Anonymous User (login or join us)
Upload

Reply to this post | See parent post | Go Back
View Post [edit]

Poster: pegz Date: November 11, 2012 03:33:12am
Forum: texts Subject: Re: Omni Magazine - any proof reading?

Firstly, thanks for the apology, I'm sorry if I went a bit overboard, it was probably mainly due to embarrassment! Like many others, I've been spreading the word about Omni being available again on several forums, but, alas, without trying to read one first. Maybe I should do some proof reading too :~) I guess my love of I.A. and Omni got the better of me.
Secondly, when I say 'proof reading', I don't expect perfection. I just would have thought that someone might have glanced at the first page of the first issue to be converted, (the one I pasted above), and thought "Hang on, something not quite right here....." before ploughing on through the rest.

Reply to this post
Reply [edit]

Poster: pegz Date: November 11, 2012 03:58:05am
Forum: texts Subject: Re: Omni Magazine - any proof reading?

...anyway, if the proof reading was perfect, I'd miss out on glorious lines such as 'Life is adrift in a sea of Radox'!
I know from the context it should be 'radiation', but so much more relaxing to think of it as drifting in pine scented bath salts.......

Reply to this post
Reply [edit]

Poster: aibek Date: November 11, 2012 06:09:41am
Forum: texts Subject: Re: Omni Magazine - any proof reading?

The OCR software could delete the obvious gibberish in the text file. It is not difficult for the software to identify it -- just notice that there are a very few dictionary words amongst all the characters on the page. This will take care of the cases where there are whole files with no useful word! (e.g. with Sanskrit texts.)

But it is computationally expensive! What is not even worthy of notice for one file -- like deleting obvious gibberish -- becomes something to consider when you have hundreds or thousands of files, even if the algorithm is straightforward. In the IA case, with all sorts of “derivations” of texts and audio and video of probably thousands of files simultaneously (all of which are computationally intensive) the developers are probably not free to do all that they may wish to!

But if the engineers have not considered this, someone should draw their attention to this! Note that the unedited text file is not a “faithful representation” which should not be touched; the PDF file, or more exactly, the tiff images, are the “faithful representation”.

This post was modified by aibek on 2012-11-11 14:09:41

Reply to this post
Reply [edit]

Poster: Administrator, Curator, or Staff Jeff Kaplan Date: November 11, 2012 08:55:44am
Forum: texts Subject: Re: Omni Magazine - any proof reading?

we'rewell aware that, to be generous, OCR is less than perfect. an your correct that at scale we would not be able to manually review or correct the abbyy OCR. at this point the main suggestions are to crowdsource (which we do not have the manpower to manage) or have interested folks upload corrected files to new items and use good metadata so that the corrected file items appear high in search results.

Terms of Use (10 Mar 2001)